Add score trace URL to evaluation run by vprashrex · Pull Request #558 · ProjectTech4DevAI/kaapi-backend

vprashrex · 2026-01-25T06:03:55Z

Issue: #494

Summary

Add score_trace_url field to evaluation run model
Update evaluation service to populate trace URL

Test plan

Run migrations
Verify evaluation runs capture trace URLs

Summary by CodeRabbit

New Features
- Traces can be uploaded in JSON or JSONL, stored in cloud, and lazily loaded; when cloud traces exist only summaries are kept in the DB, with resync option forcing live refetch.
Bug Fixes / Reliability
- Improved error handling and logging for cloud load/upload, with DB fallback when cloud operations fail.
Tests
- Added tests covering upload/download success/failure, DB fallback, summary-only storage, and resync behavior.
Chores
- Database schema extended to record cloud trace URLs.

coderabbitai · 2026-01-25T06:04:03Z

Warning

Rate limit exceeded

@vprashrex has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds object-store-backed trace upload/load (JSON/JSONL), a nullable score_trace_url DB column and migration, and updates save/get evaluation flows to prefer S3 traces with DB fallback and Langfuse resync behavior.

Changes

Cohort / File(s)	Summary
Storage Utilities `backend/app/core/storage_utils.py`	Added `load_json_from_object_store()`; extended `upload_jsonl_to_object_store()` with `format: Literal["json","jsonl"]`, format-aware serialization, content-type handling, and expanded error logging.
Evaluation CRUD `backend/app/crud/evaluations/core.py`	`update_evaluation_run()` now accepts `score_trace_url`; `save_score()` attempts object-store upload of traces, stores only summary in DB on upload success, falls back to storing full score on failure, and propagates `score_trace_url`.
Database Model `backend/app/models/evaluation.py`	Added nullable `score_trace_url` field to `EvaluationRun` model (string, default None, with column comment).
Evaluation Service `backend/app/services/evaluations/evaluation.py`	When fetching traces, attempts S3 load via `score_trace_url` first, then DB traces, and finally Langfuse resync; merges summary_scores/traces and persists via `save_score`.
Migration `backend/app/alembic/versions/044_add_score_trace_url_to_evaluation_run.py`	Alembic migration adding/dropping `score_trace_url` column on `evaluation_run` table with comment.
Tests — storage utils `backend/app/tests/core/test_storage_utils.py`	New tests for `upload_jsonl_to_object_store()` and `load_json_from_object_store()` covering success and error/invalid JSON paths.
Tests — save_score `backend/app/tests/crud/evaluations/test_score_storage.py`	Tests for `save_score()` S3 upload behavior: successful upload stores summary-only, failure stores full score, and no-op when traces empty.
Tests — evaluation service S3 `backend/app/tests/services/evaluations/test_evaluation_service_s3.py`, `backend/app/tests/services/evaluations/__init__.py`	Tests covering S3 trace loading, DB trace fallback, and resync bypass to Langfuse; added fixtures and test cases for these flows.
Manifest `requirements.txt`, `pyproject.toml`	Minor manifest adjustments referenced by diff summary.

Sequence Diagram

sequenceDiagram
    participant Client
    participant EvalService as Evaluation Service
    participant DB as Database
    participant S3 as Object Storage
    participant Langfuse as Langfuse API

    Client->>EvalService: get_evaluation_with_scores(resync_score=false)
    EvalService->>DB: get_evaluation_run_by_id()
    DB-->>EvalService: eval_run (may include score_trace_url)

    alt score_trace_url present
        EvalService->>S3: load_json_from_object_store(score_trace_url)
        S3-->>EvalService: traces JSON
        EvalService->>EvalService: merge traces into eval_run.score
        EvalService-->>Client: return eval_run with S3 traces
    else no score_trace_url but DB traces present
        EvalService->>EvalService: use DB-cached traces from eval_run.score
        EvalService-->>Client: return eval_run with DB traces
    else resync required
        EvalService->>Langfuse: fetch_trace_scores_from_langfuse()
        Langfuse-->>EvalService: traces + summary_scores
        EvalService->>EvalService: merge and prepare score
        EvalService->>DB: save_score(merged_score) (may upload traces to S3)
        DB-->>EvalService: persisted (and possibly score_trace_url returned)
        EvalService-->>Client: return eval_run with fresh traces
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Evaluation: Fix score format #549: Related changes to evaluation scoring flow and score format handling.
Evaluation #405: Overlaps storage and evaluation model paths; likely touches storage_utils.py and CRUD storage logic.
Evaluation: Refactor #503: Modifies evaluation save/get trace integration points similar to this change.

Suggested reviewers

nishika26
Prajna1999
AkhileshNegi

Poem

🐇 I hopped to the cloud with a grin,
Traces tucked safe, stored deep within,
JSON or JSONL — neat and spry,
DB keeps the summary, S3 holds the sky,
A rabbit cheers — uploads fly high! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add score trace URL to evaluation run' clearly and concisely summarizes the main change: adding a new score_trace_url field to the EvaluationRun model and implementing the logic to populate it across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/eval-results-s3-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-25T06:09:02Z

Codecov Report

❌ Patch coverage is 96.05263% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/core.py	83.33%	4 Missing ⚠️
backend/app/core/storage_utils.py	87.50%	3 Missing ⚠️
backend/app/services/evaluations/evaluation.py	90.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

- test added - bug fixes

…est files

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/app/core/storage_utils.py (1)
86-123: ⚠️ Potential issue | 🟡 Minor

Align JSONL metadata with JSON array payload.

The payload is now a JSON array, but the docstring and content-type still say JSONL. Update the metadata to avoid confusion (or revert to JSONL formatting).
🧾 Suggested update
-    Upload JSONL (JSON Lines) content to object store.
+    Upload JSON array content to object store.
@@
-        results: List of dictionaries to be converted to JSONL
+        results: List of dictionaries to be serialized as a JSON array
@@
-        headers = Headers({"content-type": "application/jsonl"})
+        headers = Headers({"content-type": "application/json"})
backend/app/crud/evaluations/core.py (1)
184-209: ⚠️ Potential issue | 🟡 Minor

Document the new score_trace_url parameter.

The Args section doesn’t mention the newly added parameter.
✍️ Suggested update
-        object_store_url: New object store URL (optional)
-        score: New score dict (optional)
-        embedding_batch_job_id: New embedding batch job ID (optional)
+        object_store_url: New object store URL (optional)
+        score_trace_url: New per-trace score URL (optional)
+        score: New score dict (optional)
+        embedding_batch_job_id: New embedding batch job ID (optional)

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/043_add_score_trace_url_to_evaluation_run.py`:
- Around line 19-32: The migration functions upgrade and downgrade lack return
type hints; update the function signatures for upgrade() and downgrade() to
declare return type -> None so they read as upgrade() -> None: and downgrade()
-> None: (keep existing bodies unchanged) to satisfy the type-hinting rule.

In `@backend/app/core/storage_utils.py`:
- Around line 151-152: The log statement in load_json_from_object_store has a
missing closing quote and brace in the f-string; update the logger.info call
(logger.info(...)) to properly close the string and include the url variable,
e.g., fix the f-string to something like f"[load_json_from_object_store] Loading
JSON from '{url}'" so the message is well-formed and includes the trailing quote
and bracket.

In `@backend/app/services/evaluations/evaluation.py`:
- Around line 233-267: The code in get_evaluation_with_scores treats S3 and DB
as mutually exclusive (uses `elif`), so if `eval_run.score_trace_url` exists but
`load_json_from_object_store` fails or returns None you never fall back to DB
traces (`has_cached_traces_db`); change the control flow so the DB check is a
separate conditional executed after the S3 attempt/exception (i.e., replace the
`elif has_cached_traces_db` with an independent `if has_cached_traces_db` and
ensure the S3-exception handler does not return or suppress the subsequent DB
check), using the existing symbols `eval_run`, `score_trace_url`,
`has_cached_traces_db`, `load_json_from_object_store`, and
`get_evaluation_with_scores` to locate and update the logic.

In `@backend/app/tests/core/test_storage_utils.py`:
- Around line 57-98: The test methods in TestLoadJsonFromObjectStore lack
parameter type hints; update each method signature (test_load_success,
test_load_returns_none_on_error, test_load_returns_none_on_invalid_json) to
include the implicit self parameter type (self: "TestLoadJsonFromObjectStore")
and an explicit return type -> None; ensure any local test helper parameters
would also be annotated (e.g., mock_storage: MagicMock) if present, and keep the
body unchanged so assertions against load_json_from_object_store and
mock_storage.stream continue to work.
- Around line 16-55: The test methods in class TestUploadJsonlToObjectStore lack
type annotations on the implicit self parameter; update the method signatures
for test_upload_success and test_upload_returns_none_on_error to annotate self
(e.g., self: TestUploadJsonlToObjectStore) while keeping the existing return
type hints, and run tests to ensure no behavioral changes; locate these methods
and the class name TestUploadJsonlToObjectStore and add the self parameter type
annotation accordingly.

In `@backend/app/tests/crud/evaluations/test_score_storage.py`:
- Around line 14-110: The fixture mock_eval_run should be a factory (callable)
and the test functions need parameter type hints; change the fixture name to
mock_eval_run_factory (or make mock_eval_run return a Callable[[], MagicMock])
that when called creates/returns a MagicMock(spec=EvaluationRun) with id=100,
then inside tests call mock_eval_run_factory() to get the eval_run; also add
type annotations to each test parameter (e.g., mock_engine: MagicMock,
mock_get_storage: MagicMock, mock_upload: MagicMock, mock_get_eval: MagicMock,
mock_update: MagicMock, mock_eval_run_factory: Callable[[], EvaluationRun] or
MagicMock) and return type -> None for
test_uploads_traces_to_s3_and_stores_summary_only,
test_fallback_to_db_when_s3_fails, and test_no_s3_upload_when_no_traces; keep
references to save_score and update_evaluation_run/get_evaluation_run_by_id
unchanged.

In `@backend/app/tests/services/evaluations/test_evaluation_service_s3.py`:
- Around line 42-120: The tests lack type hints: add annotations to the test
methods' parameters and fixtures referenced in this file—annotate mock
parameters (mock_get_storage, mock_load, mock_get_eval, mock_get_langfuse,
mock_fetch_langfuse, mock_save_score) as MagicMock, annotate fixture parameters
completed_eval_run_with_s3 and completed_eval_run_with_db_traces as
EvaluationRun, and annotate self as "TestGetEvaluationWithScoresS3" on the
methods test_loads_traces_from_s3, test_returns_db_traces_when_no_s3_url, and
test_resync_bypasses_cache_and_fetches_langfuse; also add return type
annotations for the fixtures (completed_eval_run_with_s3 -> EvaluationRun,
completed_eval_run_with_db_traces -> EvaluationRun) and import MagicMock and
EvaluationRun types where needed.
- Around line 14-37: Replace the two near-duplicate fixtures
completed_eval_run_with_s3 and completed_eval_run_with_db_traces with a single
factory fixture (e.g., eval_run_factory) that accepts parameters like id: int,
status: str, score: dict, score_trace_url: Optional[str], dataset_name:
Optional[str], run_name: Optional[str] and returns a
MagicMock(spec=EvaluationRun) populated accordingly; update tests to call
eval_run_factory(…) to create the two scenarios. Add explicit type hints to the
fixture and returned value (eval_run_factory: Callable[..., MagicMock] or ->
MagicMock) and to all test function parameters (e.g., mock_get_storage:
MagicMock, eval_run_factory: Callable[..., MagicMock]) so every fixture/test
signature is typed. Ensure existing symbols EvaluationRun,
completed_eval_run_with_s3 and completed_eval_run_with_db_traces are
removed/replaced and all tests updated to use the new eval_run_factory and typed
parameters.

🧹 Nitpick comments (1)

backend/app/crud/evaluations/core.py (1)

386-392: Resolve the TODO about DB storage fallback.

This TODO leaves behavior undecided and includes a typo (“weather” → “whether”). Please either track it in an issue or implement the intended behavior.

If you want, I can draft the follow-up change or open an issue template entry.

coderabbitai · 2026-01-31T18:27:08Z

backend/app/alembic/versions/044_add_score_trace_url_to_evaluation_run.py

+def upgrade():
+    op.add_column(
+        "evaluation_run",
+        sa.Column(
+            "score_trace_url",
+            sqlmodel.sql.sqltypes.AutoString(),
+            nullable=True,
+            comment="S3 URL where per-trace evaluation scores are stored",
+        ),
+    )
+
+
+def downgrade():
+    op.drop_column("evaluation_run", "score_trace_url")


⚠️ Potential issue | 🟡 Minor

Add return type hints to migration functions.

Both upgrade() and downgrade() should declare -> None to comply with the type-hinting rule.

✍️ Suggested update

-def upgrade(): +def upgrade() -> None: op.add_column( "evaluation_run", sa.Column( "score_trace_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True, comment="S3 URL where per-trace evaluation scores are stored", ), ) -def downgrade(): +def downgrade() -> None: op.drop_column("evaluation_run", "score_trace_url")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def upgrade():

op.add_column(

"evaluation_run",

sa.Column(

"score_trace_url",

sqlmodel.sql.sqltypes.AutoString(),

nullable=True,

comment="S3 URL where per-trace evaluation scores are stored",

),

)

def downgrade():

op.drop_column("evaluation_run", "score_trace_url")

def upgrade() -> None:

op.add_column(

"evaluation_run",

sa.Column(

"score_trace_url",

sqlmodel.sql.sqltypes.AutoString(),

nullable=True,

comment="S3 URL where per-trace evaluation scores are stored",

),

)

def downgrade() -> None:

op.drop_column("evaluation_run", "score_trace_url")

🤖 Prompt for AI Agents

In `@backend/app/alembic/versions/043_add_score_trace_url_to_evaluation_run.py` around lines 19 - 32, The migration functions upgrade and downgrade lack return type hints; update the function signatures for upgrade() and downgrade() to declare return type -> None so they read as upgrade() -> None: and downgrade() -> None: (keep existing bodies unchanged) to satisfy the type-hinting rule.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/core/storage_utils.py

+def load_json_from_object_store(storage: CloudStorage, url: str) -> list | None:
+    logger.info(f"[load_json_from_object_store] Loading JSON from '{url}")


⚠️ Potential issue | 🟡 Minor

Fix log message formatting.

The opening quote isn’t closed, which makes the log line confusing.

✍️ Suggested update

- logger.info(f"[load_json_from_object_store] Loading JSON from '{url}") + logger.info(f"[load_json_from_object_store] Loading JSON from '{url}'")

🤖 Prompt for AI Agents

In `@backend/app/core/storage_utils.py` around lines 151 - 152, The log statement in load_json_from_object_store has a missing closing quote and brace in the f-string; update the logger.info call (logger.info(...)) to properly close the string and include the url variable, e.g., fix the f-string to something like f"[load_json_from_object_store] Loading JSON from '{url}'" so the message is well-formed and includes the trailing quote and bracket.

backend/app/services/evaluations/evaluation.py

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/core/test_storage_utils.py

+class TestUploadJsonlToObjectStore:
+    """Test uploading JSON content to object store."""
+
+    def test_upload_success(self) -> None:
+        """Verify successful upload returns URL and content is correct."""
+        mock_storage = MagicMock()
+        mock_storage.put.return_value = "s3://bucket/path/traces.json"
+
+        results = [{"trace_id": "t1", "score": 0.9}]
+
+        url = upload_jsonl_to_object_store(
+            storage=mock_storage,
+            results=results,
+            filename="traces.json",
+            subdirectory="evaluations/score/70",
+        )
+
+        assert url == "s3://bucket/path/traces.json"
+        mock_storage.put.assert_called_once()
+
+        # Verify content is valid JSON
+        call_args = mock_storage.put.call_args
+        upload_file = call_args.kwargs.get("source")
+        content = upload_file.file.read().decode("utf-8")
+        assert json.loads(content) == results
+
+    def test_upload_returns_none_on_error(self) -> None:
+        """Verify returns None on CloudStorageError."""
+        mock_storage = MagicMock()
+        mock_storage.put.side_effect = CloudStorageError("Upload failed")
+
+        url = upload_jsonl_to_object_store(
+            storage=mock_storage,
+            results=[{"id": 1}],
+            filename="test.json",
+            subdirectory="test",
+        )
+
+        assert url is None
+


⚠️ Potential issue | 🟡 Minor

Add parameter type hints to test methods.

Annotate self to satisfy the type-hint requirement for function parameters.

✍️ Suggested update

- def test_upload_success(self) -> None: + def test_upload_success(self: "TestUploadJsonlToObjectStore") -> None: """Verify successful upload returns URL and content is correct.""" mock_storage = MagicMock() mock_storage.put.return_value = "s3://bucket/path/traces.json" @@ - def test_upload_returns_none_on_error(self) -> None: + def test_upload_returns_none_on_error( + self: "TestUploadJsonlToObjectStore", + ) -> None: """Verify returns None on CloudStorageError.""" mock_storage = MagicMock() mock_storage.put.side_effect = CloudStorageError("Upload failed")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents

In `@backend/app/tests/core/test_storage_utils.py` around lines 16 - 55, The test methods in class TestUploadJsonlToObjectStore lack type annotations on the implicit self parameter; update the method signatures for test_upload_success and test_upload_returns_none_on_error to annotate self (e.g., self: TestUploadJsonlToObjectStore) while keeping the existing return type hints, and run tests to ensure no behavioral changes; locate these methods and the class name TestUploadJsonlToObjectStore and add the self parameter type annotation accordingly.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/core/test_storage_utils.py

+class TestLoadJsonFromObjectStore:
+    """Test loading JSON from object store."""
+
+    def test_load_success(self) -> None:
+        """Verify successful load returns parsed JSON."""
+        mock_storage = MagicMock()
+        test_data = [{"id": 1, "value": "test"}]
+        mock_storage.stream.return_value = BytesIO(
+            json.dumps(test_data).encode("utf-8")
+        )
+
+        result = load_json_from_object_store(
+            storage=mock_storage,
+            url="s3://bucket/path/file.json",
+        )
+
+        assert result == test_data
+        mock_storage.stream.assert_called_once_with("s3://bucket/path/file.json")
+
+    def test_load_returns_none_on_error(self) -> None:
+        """Verify returns None on CloudStorageError."""
+        mock_storage = MagicMock()
+        mock_storage.stream.side_effect = CloudStorageError("Download failed")
+
+        result = load_json_from_object_store(
+            storage=mock_storage,
+            url="s3://bucket/file.json",
+        )
+
+        assert result is None
+
+    def test_load_returns_none_on_invalid_json(self) -> None:
+        """Verify returns None on invalid JSON content."""
+        mock_storage = MagicMock()
+        mock_storage.stream.return_value = BytesIO(b"not valid json")
+
+        result = load_json_from_object_store(
+            storage=mock_storage,
+            url="s3://bucket/file.json",
+        )
+
+        assert result is None


⚠️ Potential issue | 🟡 Minor

Add parameter type hints to test methods.

Same type-hint requirement applies to this class’ methods.

✍️ Suggested update

- def test_load_success(self) -> None: + def test_load_success(self: "TestLoadJsonFromObjectStore") -> None: """Verify successful load returns parsed JSON.""" mock_storage = MagicMock() test_data = [{"id": 1, "value": "test"}] @@ - def test_load_returns_none_on_error(self) -> None: + def test_load_returns_none_on_error( + self: "TestLoadJsonFromObjectStore", + ) -> None: """Verify returns None on CloudStorageError.""" mock_storage = MagicMock() mock_storage.stream.side_effect = CloudStorageError("Download failed") @@ - def test_load_returns_none_on_invalid_json(self) -> None: + def test_load_returns_none_on_invalid_json( + self: "TestLoadJsonFromObjectStore", + ) -> None: """Verify returns None on invalid JSON content.""" mock_storage = MagicMock() mock_storage.stream.return_value = BytesIO(b"not valid json")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents

In `@backend/app/tests/core/test_storage_utils.py` around lines 57 - 98, The test methods in TestLoadJsonFromObjectStore lack parameter type hints; update each method signature (test_load_success, test_load_returns_none_on_error, test_load_returns_none_on_invalid_json) to include the implicit self parameter type (self: "TestLoadJsonFromObjectStore") and an explicit return type -> None; ensure any local test helper parameters would also be annotated (e.g., mock_storage: MagicMock) if present, and keep the body unchanged so assertions against load_json_from_object_store and mock_storage.stream continue to work.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/crud/evaluations/test_score_storage.py

+    @pytest.fixture
+    def mock_eval_run(self):
+        """Create a mock EvaluationRun."""
+        eval_run = MagicMock(spec=EvaluationRun)
+        eval_run.id = 100
+        return eval_run
+
+    @patch("app.crud.evaluations.core.update_evaluation_run")
+    @patch("app.crud.evaluations.core.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.upload_jsonl_to_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    @patch("app.core.db.engine")
+    def test_uploads_traces_to_s3_and_stores_summary_only(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify traces uploaded to S3, only summary_scores stored in DB."""
+        mock_get_eval.return_value = mock_eval_run
+        mock_get_storage.return_value = MagicMock()
+        mock_upload.return_value = "s3://bucket/traces.json"
+
+        score = {
+            "summary_scores": [{"name": "accuracy", "avg": 0.9}],
+            "traces": [{"trace_id": "t1"}],
+        }
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        # Verify upload was called with traces
+        mock_upload.assert_called_once()
+        assert mock_upload.call_args.kwargs["results"] == [{"trace_id": "t1"}]
+
+        # Verify DB gets summary only, not traces
+        call_kwargs = mock_update.call_args.kwargs
+        assert call_kwargs["score"] == {
+            "summary_scores": [{"name": "accuracy", "avg": 0.9}]
+        }
+        assert call_kwargs["score_trace_url"] == "s3://bucket/traces.json"
+
+    @patch("app.crud.evaluations.core.update_evaluation_run")
+    @patch("app.crud.evaluations.core.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.upload_jsonl_to_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    @patch("app.core.db.engine")
+    def test_fallback_to_db_when_s3_fails(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify full score stored in DB when S3 upload fails."""
+        mock_get_eval.return_value = mock_eval_run
+        mock_get_storage.return_value = MagicMock()
+        mock_upload.return_value = None  # S3 failed
+
+        score = {
+            "summary_scores": [{"name": "accuracy", "avg": 0.9}],
+            "traces": [{"trace_id": "t1"}],
+        }
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        # Full score stored in DB as fallback
+        call_kwargs = mock_update.call_args.kwargs
+        assert call_kwargs["score"] == score
+        assert call_kwargs["score_trace_url"] is None
+
+    @patch("app.crud.evaluations.core.update_evaluation_run")
+    @patch("app.crud.evaluations.core.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.upload_jsonl_to_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    @patch("app.core.db.engine")
+    def test_no_s3_upload_when_no_traces(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify S3 upload skipped when traces is empty."""
+        mock_get_eval.return_value = mock_eval_run
+
+        score = {"summary_scores": [{"name": "accuracy", "avg": 0.9}], "traces": []}
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        mock_upload.assert_not_called()


⚠️ Potential issue | 🟡 Minor

Use a factory fixture and add parameter type hints.

Guidelines call for factory-style fixtures and typed parameters. Consider converting the fixture into a callable factory and annotating the mocked parameters.

✍️ Suggested update (apply pattern to other tests too)

+from typing import Callable from unittest.mock import MagicMock, patch @@ class TestSaveScoreS3Upload: @@ `@pytest.fixture` - def mock_eval_run(self): + def eval_run_factory( + self: "TestSaveScoreS3Upload", + ) -> Callable[[], EvaluationRun]: """Create a mock EvaluationRun.""" - eval_run = MagicMock(spec=EvaluationRun) - eval_run.id = 100 - return eval_run + def _factory() -> EvaluationRun: + eval_run = MagicMock(spec=EvaluationRun) + eval_run.id = 100 + return eval_run + + return _factory @@ def test_uploads_traces_to_s3_and_stores_summary_only( - self, - mock_engine, - mock_get_storage, - mock_upload, - mock_get_eval, - mock_update, - mock_eval_run, + self: "TestSaveScoreS3Upload", + mock_engine: MagicMock, + mock_get_storage: MagicMock, + mock_upload: MagicMock, + mock_get_eval: MagicMock, + mock_update: MagicMock, + eval_run_factory: Callable[[], EvaluationRun], ) -> None: @@ - mock_get_eval.return_value = mock_eval_run + mock_get_eval.return_value = eval_run_factory()

As per coding guidelines, /*.py: Always add type hints to all function parameters and return values in Python code; backend/app/tests//*.py: Use factory pattern for test fixtures in backend/app/tests/.

🤖 Prompt for AI Agents

In `@backend/app/tests/crud/evaluations/test_score_storage.py` around lines 14 - 110, The fixture mock_eval_run should be a factory (callable) and the test functions need parameter type hints; change the fixture name to mock_eval_run_factory (or make mock_eval_run return a Callable[[], MagicMock]) that when called creates/returns a MagicMock(spec=EvaluationRun) with id=100, then inside tests call mock_eval_run_factory() to get the eval_run; also add type annotations to each test parameter (e.g., mock_engine: MagicMock, mock_get_storage: MagicMock, mock_upload: MagicMock, mock_get_eval: MagicMock, mock_update: MagicMock, mock_eval_run_factory: Callable[[], EvaluationRun] or MagicMock) and return type -> None for test_uploads_traces_to_s3_and_stores_summary_only, test_fallback_to_db_when_s3_fails, and test_no_s3_upload_when_no_traces; keep references to save_score and update_evaluation_run/get_evaluation_run_by_id unchanged.

backend/app/tests/services/evaluations/test_evaluation_service_s3.py

AkhileshNegi · 2026-02-06T04:28:36Z

backend/app/services/evaluations/evaluation.py

+    from app.core.cloud.storage import get_cloud_storage
+    from app.core.storage_utils import load_json_from_object_store


we follow PEP8 standards where all imports are at the top of the file instead of above function in middle of file

Okay I will update this

- add jsonl and json seprate upload object storage that support both openai batch response and traces from langfuse

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/044_add_score_trace_url_to_evaluation_run.py`:
- Around line 1-7: The module docstring's Revision ID ("Revision ID: 043") is
out of sync with the migration metadata; update the docstring to "Revision ID:
044" so it matches the module-level revision variable named revision (and keep
the existing down_revision value unchanged); locate the top-of-file docstring
and replace the incorrect ID to ensure the header and the revision variable
(revision = "044") are consistent.

In `@backend/app/core/storage_utils.py`:
- Around line 158-170: The return type of load_json_from_object_store is too
narrow: json.loads can return dict, list, str, int, float, bool, or None; update
the function signature (load_json_from_object_store) to a broader typing such as
dict | list | None or typing.Any/typing.Union[...] as appropriate for callers,
and adjust any related type hints/imports if needed so the declared return type
matches json.loads' possible return types.
- Around line 114-120: Remove the trailing whitespace and wrap the long
assignment so it complies with Black/line-length rules: in the branch where
format == "jsonl" update the jsonl_content and content_type assignments
(referencing jsonl_content, content_type, format and results) to avoid a single
long line—e.g., build the "\n".join(...) expression across multiple lines or
assign the generator to an intermediate variable and then join, and ensure there
is no trailing space at the end of the line; do the same style-wise for the else
branch (json.dumps(results, ensure_ascii=False)) to keep lines under the limit.

In `@backend/app/crud/evaluations/core.py`:
- Around line 387-388: Fix the typo in the TODO comment: change "weather" to
"whether" in the comment line that starts with "# TODO: Evaluate weather this
behaviour..." located near the "# IF TRACES DATA IS STORED IN S3 URL..." block
in evaluations core (the TODO comment referencing storing data in db).

🧹 Nitpick comments (2)

backend/app/core/storage_utils.py (1)

88-88: Parameter format shadows the Python built-in.

Consider renaming to output_format or file_format to avoid shadowing the built-in format() function. Also, the docstring (lines 90–105) doesn't document the new format parameter.
backend/app/crud/evaluations/core.py (1)
190-208: Docstring doesn't document the new score_trace_url parameter.

Line 190 adds score_trace_url to the signature, but the Args section (lines 200–207) omits it.
Proposed fix
         object_store_url: New object store URL (optional)
+        score_trace_url: S3 URL for per-trace score data (optional)
         score: New score dict (optional)

backend/app/alembic/versions/044_add_score_trace_url_to_evaluation_run.py

backend/app/core/storage_utils.py

backend/app/crud/evaluations/core.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/tests/crud/evaluations/test_score_storage.py`:
- Around line 94-110: Add an assertion that update_evaluation_run (mock_update)
was called with the expected payload when traces is empty: after calling
save_score(...) assert mock_update.assert_called_once() and that it was called
with eval_run_id=100, organization_id=1, project_id=1 and an update dict
containing the same score object and "score_trace_url": None (or equivalent key
used in the code); reference the test function test_no_s3_upload_when_no_traces
and the save_score call to locate where to insert these assertions.

coderabbitai · 2026-02-06T08:22:17Z

backend/app/tests/crud/evaluations/test_score_storage.py

+    def test_no_s3_upload_when_no_traces(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify S3 upload skipped when traces is empty."""
+        mock_get_eval.return_value = mock_eval_run
+
+        score = {"summary_scores": [{"name": "accuracy", "avg": 0.9}], "traces": []}
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        mock_upload.assert_not_called()


⚠️ Potential issue | 🟡 Minor

Add assertions for the DB update when traces are empty.

The test verifies upload is not called, but doesn't assert what update_evaluation_run receives. Without that, there's no coverage for whether the score is stored correctly (and whether score_trace_url is None) in the no-traces path.

✍️ Suggested addition

save_score(eval_run_id=100, organization_id=1, project_id=1, score=score) mock_upload.assert_not_called() + + # Verify DB update is called with the full score and no trace URL + call_kwargs = mock_update.call_args.kwargs + assert call_kwargs["score"] == score + assert call_kwargs["score_trace_url"] is None

🤖 Prompt for AI Agents

In `@backend/app/tests/crud/evaluations/test_score_storage.py` around lines 94 - 110, Add an assertion that update_evaluation_run (mock_update) was called with the expected payload when traces is empty: after calling save_score(...) assert mock_update.assert_called_once() and that it was called with eval_run_id=100, organization_id=1, project_id=1 and an update dict containing the same score object and "score_trace_url": None (or equivalent key used in the code); reference the test function test_no_s3_upload_when_no_traces and the save_score call to locate where to insert these assertions.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/services/evaluations/evaluation.py (1)
320-330: ⚠️ Potential issue | 🟡 Minor

save_score returning None silently discards the original eval_run.

If save_score returns None (e.g., the eval run isn't found in the new independent session), the previously fetched eval_run is overwritten and the caller receives (None, None) — indistinguishable from "evaluation not found". Consider preserving the original reference:
+    original_eval_run = eval_run
     eval_run = save_score(
         eval_run_id=eval_run_id,
         organization_id=organization_id,
         project_id=project_id,
         score=score,
     )

-    if eval_run:
-        eval_run.score = score
+    if eval_run:
+        eval_run.score = score
+    else:
+        logger.warning(
+            f"[get_evaluation_with_scores] save_score returned None | "
+            f"evaluation_id={evaluation_id}"
+        )
+        original_eval_run.score = score
+        eval_run = original_eval_run

     return eval_run, None

🤖 Fix all issues with AI agents

In `@backend/app/crud/evaluations/core.py`:
- Around line 353-401: The resync can leave a stale S3 URL in the DB because
when traces are empty score_trace_url remains None and update_evaluation_run
won't clear the existing value; in save_score, detect the case of empty traces
on resync and explicitly clear the stored URL (e.g., set score_trace_url = "" or
add a separate clear flag) before calling update_evaluation_run so the DB entry
is overwritten/cleared; update the logic that builds db_score and the arguments
passed to update_evaluation_run (referencing save_score, score_trace_url,
upload_jsonl_to_object_store, and update_evaluation_run) to ensure an explicit
clear is persisted when traces are absent.

In `@backend/app/tests/services/evaluations/test_evaluation_service_s3.py`:
- Around line 77-78: The test is patching get_cloud_storage at the wrong import
path; change the second decorator to patch get_cloud_storage where the
evaluation service imports it (use
"app.services.evaluations.evaluation.get_cloud_storage" to match the module
under test) so that the mock is applied correctly to the EvaluationService code;
also verify the patched parameter order in the test function matches the
decorator order for get_evaluation_run_by_id and get_cloud_storage.
- Around line 109-114: The test is patching
app.core.storage_utils.load_json_from_object_store and
app.core.cloud.storage.get_cloud_storage directly, but you must patch the names
as they are imported/used in the module under test; change those two decorators
to patch "app.services.evaluations.evaluation.load_json_from_object_store" and
"app.services.evaluations.evaluation.get_cloud_storage" so the mocks replace the
references used by the functions under test (referencing
load_json_from_object_store and get_cloud_storage in the evaluation module).
- Around line 40-48: The test patches are targeting the original modules instead
of the names imported into evaluation.py, so update the `@patch` decorators to
patch the symbols as looked up by app.services.evaluations.evaluation: replace
patch("app.core.storage_utils.load_json_from_object_store") with
patch("app.services.evaluations.evaluation.load_json_from_object_store"),
patch("app.core.cloud.storage.get_cloud_storage") with
patch("app.services.evaluations.evaluation.get_cloud_storage"), and likewise
ensure get_evaluation_run_by_id is patched as
patch("app.services.evaluations.evaluation.get_evaluation_run_by_id") so mocks
for load_json_from_object_store, get_cloud_storage, and get_evaluation_run_by_id
intercept calls inside the load_traces logic in evaluation.py.

🧹 Nitpick comments (4)

backend/app/core/storage_utils.py (2)

119-127: Variable name jsonl_content is misleading when format="json".

When format="json", the content is a JSON array, not JSONL. The variable jsonl_content (line 126) is confusing in this branch. Consider renaming to something format-agnostic like content or serialized_content.

165-193: Add a docstring to load_json_from_object_store.

Every other public function in this module has a docstring with Args/Returns/Note sections. This function should follow the same pattern for consistency.

backend/app/crud/evaluations/core.py (1)

198-228: score_trace_url parameter undocumented in docstring.

The score_trace_url parameter was added to the signature (line 194) but is missing from the Args: section of the docstring (line 210 area).

backend/app/tests/services/evaluations/test_evaluation_service_s3.py (1)

3-5: Prefer str | None over Optional[str] for Python 3.11+.

The Optional import (line 4) and usage (lines 25-27) can be replaced with the native str | None union syntax, consistent with the rest of the codebase.

As per coding guidelines, **/*.py: Use Python 3.11+ with type hints throughout the codebase.

backend/app/crud/evaluations/core.py

backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Add score trace URL to evaluation run

87c8e40

vprashrex self-assigned this Jan 27, 2026

vprashrex added the enhancement New feature or request label Jan 27, 2026

vprashrex linked an issue Jan 27, 2026 that may be closed by this pull request

Evaluation: Optimize fetch results #494

Closed

vprashrex added 6 commits January 29, 2026 11:53

Merge branch 'main' into feat/eval-results-s3-storage

01ce874

Migration: Add score_trace_url column to evaluation_run

8736311

Merge branch 'main' into feat/eval-results-s3-storage

62d0673

Migration: Add score_trace_url column to evaluation_run

0ac775e

fixes

6e3d12e

- test added - bug fixes

Refactor: Improve formatting and readability in storage utility and t…

afc0d70

…est files

vprashrex marked this pull request as ready for review January 31, 2026 18:20

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

vprashrex added the ready-for-review label Feb 2, 2026

vprashrex requested review from Prajna1999 and nishika26 February 2, 2026 02:51

bug fix

788f3ae

vprashrex requested review from AkhileshNegi and removed request for Prajna1999 February 6, 2026 04:25

AkhileshNegi reviewed Feb 6, 2026

View reviewed changes

vprashrex added 2 commits February 6, 2026 12:09

Merge branch 'main' into feat/eval-results-s3-storage

a9a55ac

- migration updated

5d91433

- add jsonl and json seprate upload object storage that support both openai batch response and traces from langfuse

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

vprashrex added 3 commits February 6, 2026 13:20

formatted using pre-commit and maintain PEP8 standards

87a64ff

restructure storage_utils tests for JSON and JSONL formats

d662916

mocked patch for cloud storage

941d02a

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

vprashrex requested a review from AkhileshNegi February 6, 2026 08:24

maintained PEP8 standards

cc0ca16

AkhileshNegi approved these changes Feb 6, 2026

View reviewed changes

vprashrex added 2 commits February 6, 2026 17:20

code-rabbit proposed changes resolved

61c9082

code-formatted properly

dccf470

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

vprashrex added 3 commits February 6, 2026 17:32

fix: correct mock patch paths in S3 evaluation tests

8c0fe52

added expilct type annotation for scre-trace-variable

a43cdf0

Merge remote-tracking branch 'origin' into feat/eval-results-s3-storage

be7111d

vprashrex merged commit 7ca3cca into main Feb 6, 2026
3 checks passed

vprashrex deleted the feat/eval-results-s3-storage branch February 6, 2026 13:13

		def load_json_from_object_store(storage: CloudStorage, url: str) -> list \| None:
		logger.info(f"[load_json_from_object_store] Loading JSON from '{url}")

		from app.core.cloud.storage import get_cloud_storage
		from app.core.storage_utils import load_json_from_object_store

Conversation

vprashrex commented Jan 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

codecov bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AkhileshNegi Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

vprashrex Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vprashrex commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

codecov bot commented Jan 25, 2026 •

edited

Loading